De gekozen dataset is afkomstig van de University of Colorado, Irvine. Dezelfde dataset is gebruikt in een wetenschappelijk artikel, namelijk "Using machine learning techniques to generate laboratory diagnostic pathways—a case study" door Hoffman et al. (2018) en werd gepubliceerd in "Journal of Laboratory and Precision Medicine".
De dataset gaat over patiënten met een leveraandoening, Hepatitis C. De data bestaat uit een groep gezonde proefpersonen en een groep patiënten met leveraandoeningen. De dataset bestaat uit 14 kolommen, waarbij 4 kolommen gaan over de patiënt (geslacht, leeftijd, gezond/ziek en ID) en 10 kolommen gaan over de waardes van de uitgevoerde laboratoriumtesten. Er is data van 615 patiënten.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
Data inlezen¶
We beginnen met het inlezen van de data. Deze wordt omgezet in een pandas DataFrame, zodat we een makkelijk overzicht hebben van de data.
hepatitis_c_csv = "HepatitisCdata.csv"
hepatitis_data = pd.read_csv(hepatitis_c_csv, sep=',', header=0)
hepatitis_data.head()
| Unnamed: 0 | Category | Age | Sex | ALB | ALP | ALT | AST | BIL | CHE | CHOL | CREA | GGT | PROT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0=Blood Donor | 32 | m | 38.5 | 52.5 | 7.7 | 22.1 | 7.5 | 6.93 | 3.23 | 106.0 | 12.1 | 69.0 |
| 1 | 2 | 0=Blood Donor | 32 | m | 38.5 | 70.3 | 18.0 | 24.7 | 3.9 | 11.17 | 4.80 | 74.0 | 15.6 | 76.5 |
| 2 | 3 | 0=Blood Donor | 32 | m | 46.9 | 74.7 | 36.2 | 52.6 | 6.1 | 8.84 | 5.20 | 86.0 | 33.2 | 79.3 |
| 3 | 4 | 0=Blood Donor | 32 | m | 43.2 | 52.0 | 30.6 | 22.6 | 18.9 | 7.33 | 4.74 | 80.0 | 33.8 | 75.7 |
| 4 | 5 | 0=Blood Donor | 32 | m | 39.2 | 74.1 | 32.6 | 24.8 | 9.6 | 9.15 | 4.32 | 76.0 | 29.9 | 68.7 |
Vervolgens wordt er aan de hand van de dataset een codebook gemaakt, zodat duidelijk is welke units worden gebruikt en wat de afkortingen van de testen betekenen. De dataset bestaat enkel uit patienteninformatie en laboratoriumtesten, deze gaan we allemaal meenemen om later uit te zoeken welke waarden mogelijk gecorreleerd zijn. Aan de hand daarvan gaan we kijken welke bloedwaarden het meeste zeggen over een Hepatitis C infectie.
codebook = {
"attribute": ["ID", "Category", "Age", "Sex", "ALB", "ALP", "ALT", "AST", "BIL", "CHE", "CHOL", "CREA", "GGT", "PROT"],
"unit": ["a.u.", "n.a.", "years", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u."],
"dtype": ["integer", "category", "integer", "category", "float", "float", "float", "float", "float", "float", "float", "float", "float", "float",],
"description": [
"Patient ID",
"Diagnosis (0=Blood Donor, 0s=suspect Blood Donor, 1=Hepatitis, 2=Fibrosis, 3=Cirrhosis)",
"Age",
"Sex (M/F)",
"Albumin",
"Alkaline phosphatase",
"Alanine aminotransferase",
"Aspartate aminotransferase",
"Bilirubin",
"Cholinesterase",
"Cholesterol",
"Creatinine",
"Gamma-glutamyltransferase",
"Protein"
]
}
pd.DataFrame(codebook).set_index("attribute")
| unit | dtype | description | |
|---|---|---|---|
| attribute | |||
| ID | a.u. | integer | Patient ID |
| Category | n.a. | category | Diagnosis (0=Blood Donor, 0s=suspect Blood Don... |
| Age | years | integer | Age |
| Sex | a.u. | category | Sex (M/F) |
| ALB | a.u. | float | Albumin |
| ALP | a.u. | float | Alkaline phosphatase |
| ALT | a.u. | float | Alanine aminotransferase |
| AST | a.u. | float | Aspartate aminotransferase |
| BIL | a.u. | float | Bilirubin |
| CHE | a.u. | float | Cholinesterase |
| CHOL | a.u. | float | Cholesterol |
| CREA | a.u. | float | Creatinine |
| GGT | a.u. | float | Gamma-glutamyltransferase |
| PROT | a.u. | float | Protein |
We weten uit hoeveel rijen en kolommen de dataset bestaat, om zeker te weten of we de data volledig hebben ingeladen controleren we dit eerst nog even.
hepatitis_data.shape
(615, 14)
De category in de dataset bevat 5 verschillende opties: 0=Blood Donor, 0s=suspect Blood Donor, 1=Hepatitis, 2=Fibrosis en 3=Cirrhosis. De groep 0s=suspect Blood Donor is een kleine groep, bestaande uit 7 instances. In zowel kaggle als het geciteerde artikel is niet te vinden wat deze groep patiënten een aparte groep maakt: het is niet duidelijk wat er anders is. Om deze reden zullen deze instances uit de dataset gehaald worden. Het toevoegen van de groep aan de 0=Blood Donor maakt het aantal negatieven enkel groter en het toevoegen van de groep aan 1, 2 of 3 maakt dat het machine learning algoritme later mogelijk foutieve uitslagen zal genereren.
hepatitis_data = hepatitis_data[hepatitis_data['Category'] != '0s=suspect Blood Donor']
if any(hepatitis_data['Category'] == '0s=suspect Blood Donor'):
print("Instances met category '0s=suspect Blood Donor' zitten nog in de DataFrame.")
else:
print("Instances met category '0s=suspect Blood Donor' zitten niet meer in de DataFrame.")
Instances met category '0s=suspect Blood Donor' zitten niet meer in de DataFrame.
Conclusie: De data is volledig ingeladen en klaar voor de Exploratory Data Analysis.
Exploratory Data Analysis (univariaat)¶
pd.DataFrame({"is na": hepatitis_data.isna().sum()}).T
| Unnamed: 0 | Category | Age | Sex | ALB | ALP | ALT | AST | BIL | CHE | CHOL | CREA | GGT | PROT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| is na | 0 | 0 | 0 | 0 | 1 | 18 | 1 | 0 | 0 | 0 | 10 | 0 | 0 | 1 |
Aan deze tabel is te zien hoeveel missing instances er zijn. Maar 5 kolommen missen data, namelijk ALB, ALP, ALT, CHOl en PROT. Daarvan missen ALB, ALT en PROT alle drie maar één instance. CHOL mist 10 instances en ALP mist er 18. Deze aantallen vallen mee op het totaal van 615.
hepatitis_data.describe()
| Unnamed: 0 | Age | ALB | ALP | ALT | AST | BIL | CHE | CHOL | CREA | GGT | PROT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 608.000000 | 608.000000 | 607.000000 | 590.000000 | 607.000000 | 608.000000 | 608.000000 | 608.000000 | 598.000000 | 608.000000 | 608.000000 | 607.000000 |
| mean | 305.363487 | 47.291118 | 41.818781 | 67.821017 | 27.601318 | 34.369408 | 11.474013 | 8.204885 | 5.378829 | 81.513158 | 38.243914 | 72.253213 |
| std | 176.981084 | 9.992705 | 5.406717 | 25.274423 | 21.227539 | 32.622442 | 19.770558 | 2.168400 | 1.119394 | 49.720652 | 51.953220 | 4.932252 |
| min | 1.000000 | 19.000000 | 20.000000 | 11.300000 | 0.900000 | 12.000000 | 1.800000 | 1.420000 | 1.430000 | 8.000000 | 4.500000 | 51.000000 |
| 25% | 152.750000 | 39.000000 | 39.000000 | 52.500000 | 16.400000 | 21.600000 | 5.300000 | 6.950000 | 4.620000 | 68.000000 | 15.700000 | 69.450000 |
| 50% | 304.500000 | 47.000000 | 42.000000 | 66.000000 | 23.000000 | 25.850000 | 7.300000 | 8.270000 | 5.300000 | 77.000000 | 23.250000 | 72.200000 |
| 75% | 456.250000 | 54.000000 | 45.250000 | 79.525000 | 32.750000 | 32.800000 | 11.300000 | 9.585000 | 6.075000 | 88.000000 | 39.200000 | 75.400000 |
| max | 615.000000 | 77.000000 | 82.200000 | 416.600000 | 258.000000 | 324.000000 | 254.000000 | 16.410000 | 9.670000 | 1079.100000 | 650.900000 | 90.000000 |
Per kolom laat deze tabel zien wat het gemiddelde, het totaal etc. per kolom is. Omdat de kolom 'Unnamed' niet van belang is voor de EDA, wordt deze eruit gehaald.
hepatitis_data.drop(hepatitis_data.columns[0], axis=1, inplace=True)
hepatitis_data
| Category | Age | Sex | ALB | ALP | ALT | AST | BIL | CHE | CHOL | CREA | GGT | PROT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0=Blood Donor | 32 | m | 38.5 | 52.5 | 7.7 | 22.1 | 7.5 | 6.93 | 3.23 | 106.0 | 12.1 | 69.0 |
| 1 | 0=Blood Donor | 32 | m | 38.5 | 70.3 | 18.0 | 24.7 | 3.9 | 11.17 | 4.80 | 74.0 | 15.6 | 76.5 |
| 2 | 0=Blood Donor | 32 | m | 46.9 | 74.7 | 36.2 | 52.6 | 6.1 | 8.84 | 5.20 | 86.0 | 33.2 | 79.3 |
| 3 | 0=Blood Donor | 32 | m | 43.2 | 52.0 | 30.6 | 22.6 | 18.9 | 7.33 | 4.74 | 80.0 | 33.8 | 75.7 |
| 4 | 0=Blood Donor | 32 | m | 39.2 | 74.1 | 32.6 | 24.8 | 9.6 | 9.15 | 4.32 | 76.0 | 29.9 | 68.7 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 610 | 3=Cirrhosis | 62 | f | 32.0 | 416.6 | 5.9 | 110.3 | 50.0 | 5.57 | 6.30 | 55.7 | 650.9 | 68.5 |
| 611 | 3=Cirrhosis | 64 | f | 24.0 | 102.8 | 2.9 | 44.4 | 20.0 | 1.54 | 3.02 | 63.0 | 35.9 | 71.3 |
| 612 | 3=Cirrhosis | 64 | f | 29.0 | 87.3 | 3.5 | 99.0 | 48.0 | 1.66 | 3.63 | 66.7 | 64.2 | 82.0 |
| 613 | 3=Cirrhosis | 46 | f | 33.0 | NaN | 39.0 | 62.0 | 20.0 | 3.56 | 4.20 | 52.0 | 50.0 | 71.0 |
| 614 | 3=Cirrhosis | 59 | f | 36.0 | NaN | 100.0 | 80.0 | 12.0 | 9.07 | 5.30 | 67.0 | 34.0 | 68.0 |
608 rows × 13 columns
Er is te zien dat er nu nog maar 13 kolommen zijn in plaats van 14. Nu de patient ID's uit de dataset zijn gehaald, kan er verder gegaan worden.
hepatitis_data.hist(bins=20, layout=(3, 4), figsize=(16.0, 6.4));
Te zien is dat age, ALB, CHE, CHOL en PROT redelijk normaal verdeeld zijn, de resterende waardes zijn veel schever verdeeld. Dit zou mogelijk kunnen komen door outliers.
axs = hepatitis_data.boxplot(grid=False, vert=False, figsize=(12.0, 6.0))
axs.set_title("Boxplot verdelingen");
Aan de boxplot zijn de verdelingen van de waarde te zien. Een aantal testen (bijvoorbeeld CREA) heeft hoge waardes, maar klinisch gezien zou dit prima kunnen passen bij de betreffende patiënt en hoeven dit niet direct outliers te zijn. Om deze reden worden deze waarden gewoon meegenomen in de verdere EDA.
Om te proberen de scheve verhoudingen wat te corrigeren, worden er logtransformaties toegepast. Om hier een visueel beeld bij te krijgen, worden er opnieuw histogrammen per plot getoond.
hepatitis_log = np.log10(hepatitis_data.select_dtypes('number'))
hepatitis_log.hist(bins=20, layout=(3, 4), figsize=(16.0, 6.4));
Te zien is dat de verdelingen na het uitvoeren van de logtransformatie normaler verdeeld zijn. We gaan hetzelfde doen met de boxplot.
axs = hepatitis_log.boxplot(grid=False, vert=False, figsize=(12.0, 6.0))
axs.set_title("Boxplot verdelingen met logtransformatie");
Ook hier is te zien dat na de logtransformatie de data beter (normaler) verdeeld is. We gaan daarom door met de logaritmisch getransformeerde waardes voor enkele attributen. Age, ALB, CHE, CHOL en PROT zijn van zichzelf normaal verdeeld en hoeven dus niet gecorrigeerd te worden. De attributen ALP, ALT, AST, BIL, CREA en GGT zijn logaritmisch getransformeerd normaler verdeeld, deze zullen dus wel gecorrigeerd worden.
for attribute in ("ALP", "ALT", "AST", "BIL", "CREA", "GGT"):
if attribute in codebook["attribute"]:
newname = "log(" + attribute + ")"
index = codebook["attribute"].index(attribute)
codebook["attribute"][index] = newname
codebook["description"][index] = "Log10-transform of " + codebook["description"][index]
hepatitis_data.rename(columns={attribute: newname}, inplace=True)
hepatitis_data[newname] = hepatitis_log[attribute]
pd.DataFrame(codebook).set_index("attribute")
| unit | dtype | description | |
|---|---|---|---|
| attribute | |||
| ID | a.u. | integer | Patient ID |
| Category | n.a. | category | Diagnosis (0=Blood Donor, 0s=suspect Blood Don... |
| Age | years | integer | Age |
| Sex | a.u. | category | Sex (M/F) |
| ALB | a.u. | float | Albumin |
| log(ALP) | a.u. | float | Log10-transform of Alkaline phosphatase |
| log(ALT) | a.u. | float | Log10-transform of Alanine aminotransferase |
| log(AST) | a.u. | float | Log10-transform of Aspartate aminotransferase |
| log(BIL) | a.u. | float | Log10-transform of Bilirubin |
| CHE | a.u. | float | Cholinesterase |
| CHOL | a.u. | float | Cholesterol |
| log(CREA) | a.u. | float | Log10-transform of Creatinine |
| log(GGT) | a.u. | float | Log10-transform of Gamma-glutamyltransferase |
| PROT | a.u. | float | Protein |
Aan bovenstaande codebook te zien, is het gelukt om de attributen waar dit nodig was aan te passen naar de logaritmisch getransformeerde waarde.
axs = sns.histplot(hepatitis_data, x="Category", hue="Sex", multiple="stack", shrink=0.8)
axs.set_title("Verdeling diagnoses per geslacht");
plt.xticks(rotation=45)
plt.show()
Aan het staafdiagram met "Verdeling diagnoses per geslacht" is te zien dat de dataset scheef verdeeld is. Het aantal gezonde patiënten (0=Blood Donor) bevat een veel groter aantal instances dan de aangedane groepen (1=Hepatitis, 2=Fibrosis en 3=Cirrhosis).
Exploratory Data Analysis (bivariaat)¶
sns.pairplot(hepatitis_data, hue ="Sex");
Uit bovenstaande pairplot is wat informatie te halen. Zo heeft de log(CREA) als enige waarde een afwijkende plot: deze zijn namelijk meer horizontaal of juist verticaal. ALB en PROT lijken een kleine samennhang te hebben, wat ook logisch is, omdat albumine het meest voorkomende eiwit in je bloed is. Daarnaast lijken ook log(AST) en log(GGT) een samenhang te hebben. De rest van de waardes lijkt niet echt een correlatie te hebben.
hepatitis_without_category = hepatitis_data.drop(columns=['Category', 'Sex'])
axs = sns.heatmap(hepatitis_without_category.corr(), annot=True, cmap="coolwarm", vmin=-1.0, vmax=1.0, square=True)
axs.set_title("Paarsgewijze correlaties ($R$)");
Deze heatmap laat goed zien welke waardes met elkaar correleren en welke totaal niet. De leeftijd correleert in lage mate met log(GGT), CHOL en log(ALP). Tussen de log(GGT) en log(AST) is, zoals ook in de pairplot te zien was, inderdaad correlatie aanwezig. PROT, CHE en log(ALT) correleren alle 3 in lichte mate met elkaar.
Conclusie EDA¶
Om de Exploratory Data Analysis af te kunnen sluiten, wordt eerst nog een korte conclusie gegeven over de EDA. Aan de verschillende plotjes is te zien dat er mogelijke uitschieters aanwezig kunnen zijn in de data. Denk hierbij aan een kreatinine (CREA) van >1000. Echter, dit kan klinisch een correcte waarde zijn door bijvoorbeeld een achterliggend ziektebeeld. Omdat deze informatie niet beschikbaar is, wordt er niet gefilterd op deze uitschieters. Omdat niet geheel duidelijk is wat de groep '0s=suspect Blood donor precies betekent, onderscheid van andere groepen én het een groep met weinig instances betreft, is besloten deze instances en daarmee de groep in zijn geheel te verwijderen. Samenvoegen met een andere groep kan in een later stadium voor foutieve voorspellingen zorgen, waardoor daar niet voor is gekozen. Tot slot lijkt er weinig correlatie tussen de verschillende attributes in de dataset. PROT en ALB laten wel correlatie met elkaar zien, maar dit heeft waarschijnlijk te maken met het feit dat albumine het meest voorkomende eiwit in de bloedbaan is. Andere attributes correleren licht of niet met elkaar.
De data zoals deze aan het eind van de EDA is, is goed genoeg om mee te nemen en door te gaan met de volgende stap: Machine Learning.
Machine Learning¶
Omdat de data onevenredig verdeeld is (de O groep is veel groter dan de 1, 2 en 3 groep), gaan we met behulp van de tool SMOTE extra instances creëren. Hiervoor moeten de NaN waarden eruit gehaald worden.
hepatitis_data = hepatitis_data.dropna(axis='rows')
hepatitis_data.shape
(582, 13)
We selecteren de leeftijd- en laboratoriumattributen X en stellen de diagnose in als target Y, en zetten die om in numpy arrays. De geslachten veranderen we van M en F naar respectievelijk "0.0" en "1.0", om nominale waarden te voorkomen. Tevens passen we het dtype van X aan naar float, omdat SMOTE hier graag mee werkt.
pd.set_option('future.no_silent_downcasting', True)
X = hepatitis_data.iloc[:, 1:-1].replace({"f": 0.0, "m": 1.0}).to_numpy()
y = hepatitis_data.iloc[:, 0].to_numpy()
X = X.astype(float)
X.shape, X.dtype, y.shape, y.dtype
((582, 11), dtype('float64'), (582,), dtype('O'))
Voordat er gekeken gaat worden naar welk model het meest geschikt is, wordt er eerst voor gezorgd dat de dataset wat gelijker is verdeeld. Hier is de tool SMOTE voor. Deze zal hiervoor gebruikt worden. Omdat je echter je testdata niet wil aanpassen, split je eerst je data in test- en trainingsdata.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Nu de data gescheiden is in test- en trainingsdata, gaan we SMOTE toepassen.
from imblearn.over_sampling import SMOTE
sample_ratios = {
"0=Blood Donor": 500,
"1=Hepatitis": 500,
"2=Fibrosis": 500,
"3=Cirrhosis": 500
}
smote = SMOTE(sampling_strategy=sample_ratios, k_neighbors=5, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
Er is gekozen voor het toepassen van een dict bij sampling_strategy, om zelf de verhouding te kunnen kiezen tussen de 4 verschillende categoriën. Met behulp van .shape gaan we controleren er echt meer
X_train_smote.shape, y_train_smote.shape
((2000, 11), (2000,))
De dataset is, te zien aan de .shape, inderdaad uitgebreid. Per category zijn er nu 500 instances. Vanuit hier kan er gekeken worden naar een geschikt machine learning algoritme dat past bij de dataset. Er worden 8 verschillende machine learning modellen gekozen welke allemaal met cross_validate getest worden op geschiktheid van het model bij deze dataset.
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
models = [
DecisionTreeClassifier,
DummyClassifier,
GaussianNB,
KNeighborsClassifier,
RandomForestClassifier,
LinearDiscriminantAnalysis,
QuadraticDiscriminantAnalysis,
SVC,
LogisticRegression,
AdaBoostClassifier
]
from sklearn.model_selection import cross_validate
metric_scores = {}
for model in models:
scores = cross_validate(model(), X_train_smote, y_train_smote, return_train_score=True)
for key, val in scores.items():
scores[key] = val.mean()
metric_scores[f"{model.__name__}"] = scores
pd.DataFrame(metric_scores).T
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
warnings.warn(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
warnings.warn(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
warnings.warn(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
warnings.warn(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
warnings.warn(
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| DecisionTreeClassifier | 0.017285 | 0.000805 | 0.9785 | 1.000000 |
| DummyClassifier | 0.000397 | 0.000599 | 0.2500 | 0.250000 |
| GaussianNB | 0.002633 | 0.000617 | 0.9580 | 0.961500 |
| KNeighborsClassifier | 0.003051 | 0.023944 | 0.9095 | 0.937375 |
| RandomForestClassifier | 0.564807 | 0.008219 | 0.9985 | 1.000000 |
| LinearDiscriminantAnalysis | 0.005807 | 0.001002 | 0.9520 | 0.952875 |
| QuadraticDiscriminantAnalysis | 0.002763 | 0.001300 | 1.0000 | 1.000000 |
| SVC | 0.112474 | 0.104457 | 0.6985 | 0.702500 |
| LogisticRegression | 0.128176 | 0.001809 | 0.9665 | 0.965625 |
| AdaBoostClassifier | 0.406707 | 0.014208 | 0.4055 | 0.397125 |
Met Cross Validate wordt onder andere gekeken naar de test- en trainscores per machine learning model. In combinatie met mijn dataset scoren RandomForestClassifier, DecisionTreeClassifier, KNeigborsClassifier, LinearDiscriminantAnalysis, LogisticRegression en GaussianNB heel goed. Hiervan scoren zowel RandomForestClassifier en DecisionTreeClasiffier 1.000000 op de traindata en QuadraticDiscriminantAnalysis zelfs op beide, waardoor dit 'too good to be true' lijkt, bijvoorbeeld door overfitting. Daarom wordt ervoor gekozen deze modellen te passeren. De fit_time en score_time wordt niet meegenomen in de beslissing om een bepaald algoritme wel/niet mee te nemen naar de volgende stap.
LinearDiscriminantAnalysis en SVC zijn meegenomen als test, van tevoren werd verwacht dat deze type modellen niet heel passend zouden zijn bij deze dataset. Voor SVC klopt dit ook wel, al zijn scores van rond de 70% niet heel slecht. Voor LinearDiscriminantAnalysis klopt dit niet helemaal, dit model zit namelijk veel hoger dan verwacht.
We gaan door met LinearDiscriminantAnalysis, LogisticRegression en GaussianNB.
from sklearn.feature_selection import SelectKBest
metric_scores = {}
k = 11
while k:
X_select = SelectKBest(k=k).fit_transform(X_train_smote, y_train_smote)
scores = cross_validate(GaussianNB(), X_train_smote, y_train_smote, return_train_score=True)
for key, val in scores.items():
scores[key] = val.mean()
metric_scores[f"{k} features"] = scores
k -= 1
pd.DataFrame(metric_scores).T
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| 11 features | 0.002639 | 0.000763 | 0.958 | 0.9615 |
| 10 features | 0.002233 | 0.000985 | 0.958 | 0.9615 |
| 9 features | 0.003296 | 0.000997 | 0.958 | 0.9615 |
| 8 features | 0.003479 | 0.001119 | 0.958 | 0.9615 |
| 7 features | 0.003272 | 0.000997 | 0.958 | 0.9615 |
| 6 features | 0.003287 | 0.001009 | 0.958 | 0.9615 |
| 5 features | 0.003494 | 0.001123 | 0.958 | 0.9615 |
| 4 features | 0.002597 | 0.000997 | 0.958 | 0.9615 |
| 3 features | 0.002957 | 0.001036 | 0.958 | 0.9615 |
| 2 features | 0.003306 | 0.001098 | 0.958 | 0.9615 |
| 1 features | 0.003008 | 0.001400 | 0.958 | 0.9615 |
metric_scores = {}
k = 11
while k:
X_select = SelectKBest(k=k).fit_transform(X_train_smote, y_train_smote)
scores = cross_validate(LinearDiscriminantAnalysis(), X_train_smote, y_train_smote, return_train_score=True)
for key, val in scores.items():
scores[key] = val.mean()
metric_scores[f"{k} features"] = scores
k -= 1
pd.DataFrame(metric_scores).T
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| 11 features | 0.006229 | 0.000916 | 0.952 | 0.952875 |
| 10 features | 0.110903 | 0.001026 | 0.952 | 0.952875 |
| 9 features | 0.005572 | 0.001362 | 0.952 | 0.952875 |
| 8 features | 0.005397 | 0.001005 | 0.952 | 0.952875 |
| 7 features | 0.004206 | 0.001099 | 0.952 | 0.952875 |
| 6 features | 0.005830 | 0.001047 | 0.952 | 0.952875 |
| 5 features | 0.006309 | 0.001005 | 0.952 | 0.952875 |
| 4 features | 0.005009 | 0.001015 | 0.952 | 0.952875 |
| 3 features | 0.005154 | 0.000286 | 0.952 | 0.952875 |
| 2 features | 0.004716 | 0.000999 | 0.952 | 0.952875 |
| 1 features | 0.005148 | 0.000202 | 0.952 | 0.952875 |
metric_scores = {}
k = 11
while k:
X_select = SelectKBest(k=k).fit_transform(X_train_smote, y_train_smote)
scores = cross_validate(LogisticRegression(), X_train_smote, y_train_smote, return_train_score=True)
for key, val in scores.items():
scores[key] = val.mean()
metric_scores[f"{k} features"] = scores
k -= 1
pd.DataFrame(metric_scores).T
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| 11 features | 0.036723 | 0.000599 | 0.9665 | 0.965625 |
| 10 features | 0.039044 | 0.000825 | 0.9665 | 0.965625 |
| 9 features | 0.038713 | 0.000399 | 0.9665 | 0.965625 |
| 8 features | 0.049767 | 0.001038 | 0.9665 | 0.965625 |
| 7 features | 0.043265 | 0.001748 | 0.9665 | 0.965625 |
| 6 features | 0.056151 | 0.000899 | 0.9665 | 0.965625 |
| 5 features | 0.051352 | 0.001102 | 0.9665 | 0.965625 |
| 4 features | 0.056642 | 0.001199 | 0.9665 | 0.965625 |
| 3 features | 0.043280 | 0.001024 | 0.9665 | 0.965625 |
| 2 features | 0.047585 | 0.001111 | 0.9665 | 0.965625 |
| 1 features | 0.042775 | 0.000399 | 0.9665 | 0.965625 |
Voor alle 3 de modellen geldt dat 2 features even goed werkt als 11 features. Het LogisticRegression model scoort het beste op deze dataset, dus vanaf hier nemen we enkel nog LogisticRegression mee. De volgende stap is het bepalen van welke features het meest geschikt zijn om te voorspellen welke features het meest geschikt zijn voor de voorspelling.
from sklearn.feature_selection import RFE
# Initialize estimator
estimator = LogisticRegression()
# Initialize RFE
rfe = RFE(estimator, n_features_to_select=2) # Select top 5 features, adjust as needed
# Fit RFE
rfe.fit(X_train_smote, y_train_smote)
# Filter selected features
selected_indices = np.where(rfe.support_)[0]
selected_features = [i for i, selected in enumerate(rfe.support_) if selected]
selected_features
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
[4, 5]
Kolommen 4 en 5 zijn dus het meest geschikt om te voorspellen welke vorm van Hepatitis een patiënt heeft. Dat zijn ALP en ALT. Tijd om het model te gaan fitten.
X_train_smote_df = pd.DataFrame(X_train_smote)
y = y_train_smote
X = X_train_smote_df.iloc[:, 4:6]
model = LogisticRegression()
model.fit(X, y)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Omdat ik oprecht benieuwd ben of er verschil in de ConfusionMatrix gaat zitten als ik alle features meeneem, train ik nog een tweede model met in plaats van 2 features, alle features.
y = y_train_smote
X_two = X_train_smote_df
model_two = LogisticRegression()
model_two.fit(X_two, y)
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Het fitten is gelukt. Nu door naar de laatste stappen: Predict, ConfusionMatrix en ROC curve.
prediction = model.predict(X)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
conf_mat = confusion_matrix(y, prediction)
#ConfusionMatrixDisplay(conf_mat, display_labels=hepatitis_data['Category']).plot();
disp = ConfusionMatrixDisplay(conf_mat, display_labels=hepatitis_data['Category'].unique())
disp.plot(cmap='Blues', include_values=True, xticks_rotation='vertical', values_format='d');
Deze ConfusionMatrix laat goed zien waar dat het model het beste is in het voorspellen van de vormen ´0=Blood Donor' en '3=Cirrhosis'. Bij vorm 0 worden er maar liefst 492 van de 500 goed geclassificeerd, slechts 8 gevallen worden foutief geclassificeerd: 7 als '1=Hepatitis' en 1 als '2=Fibrosis'. Bij '3=Cirrhosis' worden er 471 goed geclassificeerd: 29 van de foutieve classificaties komen terecht in de categorie '2=Fibrosis'. Categoriën 1 en 2 doen het iets minder, met respectievelijk 374 en 392 goed geclassificeerde patiënten.
Nu ga ik exact hetzelfde doen, alleen dan voor het tweede model.
prediction_two = model_two.predict(X_two)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
conf_mat = confusion_matrix(y, prediction_two)
#ConfusionMatrixDisplay(conf_mat, display_labels=hepatitis_data['Category']).plot();
disp = ConfusionMatrixDisplay(conf_mat, display_labels=hepatitis_data['Category'].unique())
disp.plot(cmap='Blues', include_values=True, xticks_rotation='vertical', values_format='d');
Deze ConfusionMatrix laat toch iets heel anders zien dan de eerste ConfusionMatrix. Categorieën 2 en 3 kunnen met deze hoeveelheid features in dit model volledig goed herkend worden. Bij 0 en 1 doet hij het ook heel goed, maar maakt hier wel een aantal fouten. Bij '0=Blood Donor' worden iets meer fouten gemaakt als bij het eerste ConfusionMatrix: 14 komen in 1 terecht, 6 in 2 en 4 in 3. Ook '1=Hepatitis' maakt wat foutjes, maar bijna 100 minder dan in het eerste model. Daardoor lijken meer features toch beter dan 2 uit het eerste model.
Door naar de ROC curve.
from sklearn.metrics import roc_curve, roc_auc_score
y_prob = model_two.predict_proba(X_two)
y_true = y_train_smote
scores = {"label": [], "AUC": []}
plt.figure(figsize=(6.4, 6.4))
plt.plot([0, 1], [0, 1], ":k")
for index, label in enumerate(hepatitis_data["Category"].unique()):
y_label = (y_true == label).astype(int) # Get binary labels for the current category
fpr, tpr, _ = roc_curve(y_label, y_prob[:, index])
scores["label"].append(label)
scores["AUC"].append(roc_auc_score(y_label, y_prob[:, index]))
plt.plot(fpr, tpr, label=label)
plt.axis("square")
plt.grid(True)
plt.title("ROC-curve")
plt.legend()
plt.show()
pd.DataFrame(scores).set_index("label")
| AUC | |
|---|---|
| label | |
| 0=Blood Donor | 0.996131 |
| 1=Hepatitis | 0.996791 |
| 2=Fibrosis | 0.998600 |
| 3=Cirrhosis | 0.999859 |
Het tweede model heb ik gebruikt voor de ROC curve, omdat die toch beter presteert dan het 1e model met minder features. Dit model kan voor alle 4 de categorieën bijna perfect classificeren (AUC > 0.99).